A Comparison of Information Concerning the Regres- 
sion Para meter in The Accelerated Failure Time Model 
under Current Duration and Length Biased Sampling: 

Does it Pay to be Patient? 

Bert van Es , Chris A.J. Klaassen , Philip J. Mokveld 

Abstract: Longitudinal observations are sometimes costly or not available. Cross 
sectional sampling can be an alternative. Observations are drawn then at a specific 
point in time from a population of durations whose distributions satisfy a core 
model. Subsequently, one has a choice. One may process the data immediately, 
obtaining so called current duration data. Or one waits until the sampled durations 
are known completely obtaining the full durations via length biased sampling. We 
compare the Fisher information for the Euclidean parameter corresponding to an 
Accelerated Failure Time core model when the observations are obtained by either 
current duration or length biased sampling. 
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1 Current duration and length biased sampling 
from the AFT model 

Two often used models in survival analysis based on longitudinal data are the 
Cox Proportional Hazards model (PH) and the Accelerated Failure Time (AFT) 
model. These two semiparametric models both have appealing interpretations and 
their properties are well understood. For instance information bounds and efficient 
estimators of the Euclidean regression parameter are available for both models. 

In situations where longitudinal observations are costly, or not available, one 
has to resort to technically more complicated but less costly sampling schemes, like 
cross sectional sampling. In a medical setting this means that instead of following a 
certain number of patients in time one selects the durations of the disease of a group 
of patients sampled at a specific point in time, obtaining a so called cross sectional 
sample. One then has a choice. Either one uses the data at hand at the time of 
sampling, i.e. the durations up to the present, obtaining so called current duration 
data, or one decides to wait until the full durations for the sampled patients are 
known. Because longer durations turn out to be sampled more frequently than 
shorter ones, the second type of sampling is known as length biased sampling. 

Let us compare the two cross sectional sampling regimes. Current duration 
sampling will only require knowledge of the duration up to the present and is 
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thus very cheap in this sense. Length biased sampling requires the time needed to 
observe the full durations of the diseases of the patients that have been sampled 
and is thus more costly than current duration sampling. 

We will assume that we sample from a population of durations that satisfy a 
semiparametric core model. By comparing information bounds for the Euclidean 
parameter under the two cross sectional sampling schemes we will investigate the 
gain in efficiency in being patient. 

Our comparison below is based on results for current duration and length biased 
sampling for the AFT core model in these situations, presented in Mokveld (2006). 
Similar results for the PH model do not exist at present. See also Van Es, Klaassen 
and Oudshoorn (2000) for some general features of current duration sampling. 

1.1 The core AFT model 

We first introduce the AFT core model. Let T denote a duration, for instance the 
duration of the disease of an individual from a homogeneous group of patients with 
a particular disease, and let W denote a vector of covariates of dimension k with 
density h with respect to a measure v. We do not assume knowledge of h. Let 
9 € O denote an unknown fc-vector of regression parameters. 

The semiparametric AFT model for the random vector (T, W) is given by 

T = e-° Tw V, (1) 

where V is a nondegenerate random variable on [0, oo) with unknown absolutely 
continuous distribution function Go, with density go and hazard function Ao, and 
where V and W are independent. We consider estimation of 8, treating go as a 
nuisance parameter . 

From the model equation Q we can derive the conditional survival function 
Gg(t\w), the conditional density gg(t\w) and the conditional hazard function Xg(t\w) 
of T given W — w. We get, for t > 0, 

G e (t\w) 
ge{t\w) 
Xe{t\w) 

Note that given the value of the covariate vector the model is a scale model. The 
function Ao serves as baseline hazard in this scale model. Depending on the value 
of the scale e w on average the duration is decreased or increased. 
Also note that taking logarithms in the model equation @J we get 

lnT = -9 T W + lnV, 

showing that the AFT model is actually a regression model for the logarithm of the 
duration. However, differences are caused by different natural assumptions on the 
distributions of V in the AFT model and the error In V in the regression model. 



= l-G e {t\w) =G (e e w t), 
= e eTw g (e eTw t), 
= e eTw X (e eTw t). 
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1.2 Current duration and length biased sampling 

Let us assume that we observe the durations and their covariates at a specific point 
in time, the present. Let D denote the total length of a sampled duration and let 
X denote the time from onset until the present of a sampled duration. 

For simplicity we first describe the sampling distributions in the situation with- 
out covariates. If / and F are the density and distribution function of the durations 
T in the core model then under suitable assumptions the densities of D and X equal 

fn{y) = V ^ (2) 

fx{x) = ^-, (3) 
M 

where F(x) = 1 — F(x) and /i = J °° uf(u)du. It turns out that X is in distribution 
equal to DU with U uniformly distributed on the unit interval and with D and U 
independent. Hence, while formula J2J) follows from the length bias in the sampling, 
formula follows from the same length bias in selecting the duration and from 
multiplicative censoring, since at the present we only observe a fraction of the total 
duration! 

The formulas J2J and © require suitable models for the times of onset of the 
disease. In Van Es, Klaassen and Oudshoorn (2000) and Mokveld (2006) two models 
for the times of onset are described that give rise to the densities above. 

One can follow a direct approach where the random variable L denotes the time 
of onset and is uniformly distributed on the interval [— r, 0]. Subsequently one lets 
r go to infinity. The duration T is assumed to be independent from L and current 
duration sampling takes place at time zero. A duration is sampled if and only if 
T > —L (random left truncation). The disease will have lasted X = —L at time 
zero and will last D — T if we wait until recovery. The distributions of X and D 
can be computed by conditioning on T > — L. 

Following Keiding (1991) one can also follow a point process approach where 
patients get ill at the time points of a stationary Poisson process with constant 
intensity A. The durations of their disease are modelled as i.i.d random variables T 
that are independent from the Poisson process and cross sectional sampling takes 
place at some fixed point in time. By point process techniques one can show that 
N, the number of durations that are sampled, has a Poisson distribution, and, 
conditionally on N = n, the sampled times X from onset and full durations D are 
i.i.d. with the densities (J2J and 

In the regression setting with covariates we observe n i.i.d. realizations of (D, Z) 
or (X, Z) of durations (in total or from onset to present) and the sampled covariates. 
As mentioned above we consider the case where the density h of the covariate W in 
the core model is unknown. Under the AFT model assumptions for the core model, 
it turns out that given the covariate Z the distributions of both D and X belong 
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to scale parameter families, just as the distribution of the original durations T in 
the core model. In fact, they again follow an AFT model. The difference with the 
core model is that now the distribution of Z ', the observed covariate, depends on 
the Euclidean parameter 9. It does not depend on g$\ 

For x > 0, y > 0, and z £ R fc we have for the total duration D 

e eTz yg Q (e eTz y)h(z) 
Td,zW,z)- EgoVEhe -e*w ' 

fz(z) = e ~ 6TZ _i {Z J , (4) 



}d\z{v\z) = 



E h e- eTw ' 

20Tz yg o {e 0Tz y) 



E ao V 



and for the duration from onset to present X 

_ G (e eTz x)h(z) 



e~ e z h(z) 

fz(z) = E h e-™ ' (5) 

, ( M e sTz G (e sTz x) 
fx\z{x\z) = 



E 90 V 



These formulas hold under the direct approach or the point process approach for 
the times of onset described above. See Van Es, Klaassen and Oudshoorn (2000) 
or Mokveld (2006) for details. 



2 A comparison of information bounds 

We will present information bounds for estimation of the Euclidean parameter 6 
for cross sectional sampling from a core AFT model as derived in Mokveld (2006). 
Throughout, when we mention information we mean information contained in one 
observation. 

As above, primarily we consider the case where the covariate distribution is 
unknown. See Remark 12. II for the case where this distribution is known. 

2.1 Current duration and length biased sampling 

The covariance matrix of the sampled covariates appears in all information matrices 
below. It equals 

Y. z = E{Z - EZ)(Z - EZ) T . 
Note that this matrix depends on 9 through the distribution of Z. 
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Let us first define the Fisher information for scale I s (f) for a density / 

/ s (/)= J (i+ x iMy f(x)dx . (6) 

With H = J xgo(x)dx = J Go(x)dx, fi(x) equal to xgo(x)/fj, and /2(x) equal to 
Gq(x)/h, it is shown in Mokveld (2006) that efficient estimators of 9 can be con- 
structed and that the information bounds are equal to 

Szi.(/i) 

in the situation of length biased sampling where the full durations are observed, 
and to 

Szi.(/ 2 ) 

in the situation of current duration sampling where the durations from onset to 
present are observed. Rewriting I s (fi) and I s {f2) in terms of go we get 

Wi)=[ (2 + x 9 -M) 2X -^ldx 

and 

w,)= fu-^y^dx. 

J V G (x)J \i 

Remark 2.1. Let us consider the model where the covariate distribution in the core 
model is known. Then Q and |J5J show that the distribution of the covariates Z 
in the sample is the same for current duration and length biased sampling, that it 
does not depend on g$, and that the Fisher information matrix in one observation 
for 9 based on the covariates in the sample alone is equal to Hz- Under suitable 
assumptions 9 can be estimated -y^n-consistently from the covariates alone by for 
instance the maximum likelihood estimator. 

The information for 9 based on durations and covariates now equals 

Ez(J.(/i) + l) 

in the situation of length biased sampling where the full durations are observed, 
and to 

E z (7 s (/ 2 ) + l) 

in the situation of current duration sampling where the durations from onset to 
present are observed. These are obviously larger than in the situation where the 
covariate distribution is unknown. 

Note also that, using both durations and covariates in the sample, the semipara- 
metric information for 9, with go as nuisance parameter, under the two sampling 
schemes, is larger than the information based on the covariates alone. 
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2.2 A comparison 

The results in this section show that it pays to be patient. 

Theorem 2.2. Let g be an absolutely continuous density on (0, oo) with derivative 
g' a.e. and let /i = J xg(x)dx < oo. Let f\{x) be equal to xg(x)/ji and let feix) be 
equal to G(x)//x. If Isifv) and I s (fi) are finite then 

I s (h) < Is(h) (7) 

holds. 

Proof. Note that f\ is the density of Y\ = e e z X and that f 2 is the density of 
Y 2 = e e Z D. Since X — UD, with U independent of D and uniformly distributed 
on the unit interval, we have 



P{Y 2 <x) = 




So the relation between f\ and fi can be expressed as 

fi(x) = f fi(-)-du. (8) 
By expanding the square in © we see that the inequality Q holds if and only if 
/ X \f 2 )V)/ 2 (*>fe < J x 2 (^)\x)h{x)dx. (9) 

Let /i vanish at xq and be diffcrcntiable at xq with derivative f{(xo). Since f\ is 
nonnegative Lebesgue a.e., we get f[(xo) = 0. Because an absolutely continuous 
function is Lebesgue a.e. differentiable, this shows that {x : fi{x) — 0, f{(x) ^ 0} 
is a Lebesgue null set. Consequently by the Cauchy-Schwarz inequality we have 




Hence by (JHJ) we have 




Current duration versus length biased sampling 



7 



which completes the proof of the inequality provided that we show that equality 
can not occur. 

The fact that the inequality (JJJ is strict can be seen as follows. The Cauchy- 
Schwarz inequality holds with equality if and only if 




for some constant c and for all u £ [0, 1]. But for equality to hold in © this last 
equality has to hold for all x. Now writing z — x/u this condition equals 

z f'i( z ) = cxfi(z) 

for all x > and all z > x, which can obviously never hold. □ 



Actually, this theorem is a consequence of a more general inequality for Fisher 
information for scale for a product of random variables. 

Theorem 2.3. Let f be a density on (0, oo) that is absolutely continuous with 
respect to Lebesgue measure with derivative /', such that I s (f), as defined by 
is finite. If G is an arbitrary distribution function on (0, oo) and the density h is 
defined by 

h( X )= rif(*) d G(u) 

Jo u \uJ 

then 

Is(h)<I s {f) 

with equality iff G is degenerate. 

Proof. Let X be a random variable with density / . The random variable log X has 
density / then with f{x) = e x f(e x ) . One may verify that the Fisher information 
I s (f) for scale of / equals the Fisher information h{f) for location of / . Further- 
more, h is the density of the product of X and a random variable with distribution 
G . Consequently, with h defined by h{z) = e z h{e z ) and G(z) defined by G(e z ), it 
suffices to prove that h{h) < Ii(f) holds with equality iff G is degenerate. However, 
this inequality follows by Cauchy-Schwarz via 

y)^f(x-y)^f(x-y)dG(y)} 2 
= — dx 

h(x) 

< J J (j)\x-y)~f(x-y)dG{y)dx = I l {f) 1 




as has been noticed by Hajek and Sidak (1967) in their Theorem 1.2.3 on page 
17. □ 
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Figure 1: Left: Wcibull densities for 7 equal to 2 (solid), 5 (...) and 10 ( ). 

Right: information under length biased and current duration sampling (Weibull g) 
as a function of 7. 



2.2.1 Examples 

To get a feeling for the difference in information in the current duration and length 
biased observations we consider two families of densities for the nuisance parameter 
go, the Weibull densities and the log logistic densities. 

First we consider the Weibull densities. Let go be a Weibull density with pa- 
rameter 7 > 0, i.e. 

go(t) = 1 V- 1 e~ t \ t>0. 

For these densities we have 

Is(h) = 7(7 + 1), 
/.(/a) = 7- 



Next we consider log logistic densities go- Let go be a log logistic density with 
parameter 7 > 1, i.e. 

9o(t) = ~j~r : t > 0. 

For these densities we have 

^(/i) = ^(7 2 -l), 
M/a) = 5(7-1). 



These two examples show that the more concentrated the density go of the 
random variable V in the model QJ, corresponding with high parameter values 7, 
the higher the gain in being patient. 
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Figure 2: Left: Log logistic densities for 7 equal to 2 (solid), 5 (...) and 10 ( ). 

Right: information under length biased and current duration sampling (log logistic 
g) as a function of 7. 
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